This notebook aims at finding insights in a set of restaurants reviews from Yelp.
import sys
# !{sys.executable} -m pip install -r requirements.txt
import pandas as pd
import dask
import dask.dataframe as ddf
import numpy as np
import json
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm.notebook import tqdm
import plotly
import plotly.express as px
import matplotlib.pyplot as plt
from wordcloud import WordCloud, get_single_color_func
from dotenv import load_dotenv
import requests
tqdm.pandas()
plotly.offline.init_notebook_mode()
load_dotenv();
# Preparing spacy model for tokenization
model = 'en_core_web_lg'
if model not in spacy.util.get_installed_models():
!{sys.executable} -m spacy download {model}
nlp = spacy.load(model)
filepath = 'yelp_dataset/yelp_academic_dataset_review.json'
The file is too big to be opened directly (kernel crashes). So let's first have a look at the first line:
with open(filepath, 'r') as source:
sample = source.readline()
sample
'{"review_id":"lWC-xP3rd6obsecCYsGZRg","user_id":"ak0TdVmGKo4pwqdJSTLwWw","business_id":"buF9druCkbuXLX526sGELQ","stars":4.0,"useful":3,"funny":1,"cool":1,"text":"Apparently Prides Osteria had a rough summer as evidenced by the almost empty dining room at 6:30 on a Friday night. However new blood in the kitchen seems to have revitalized the food from other customers recent visits. Waitstaff was warm but unobtrusive. By 8 pm or so when we left the bar was full and the dining room was much more lively than it had been. Perhaps Beverly residents prefer a later seating. \\n\\nAfter reading the mixed reviews of late I was a little tentative over our choice but luckily there was nothing to worry about in the food department. We started with the fried dough, burrata and prosciutto which were all lovely. Then although they don\'t offer half portions of pasta we each ordered the entree size and split them. We chose the tagliatelle bolognese and a four cheese filled pasta in a creamy sauce with bacon, asparagus and grana frita. Both were very good. We split a secondi which was the special Berkshire pork secreto, which was described as a pork skirt steak with garlic potato purée and romanesco broccoli (incorrectly described as a romanesco sauce). Some tables received bread before the meal but for some reason we did not. \\n\\nManagement also seems capable for when the tenants in the apartment above began playing basketball she intervened and also comped the tables a dessert. We ordered the apple dumpling with gelato and it was also quite tasty. Portions are not huge which I particularly like because I prefer to order courses. If you are someone who orders just a meal you may leave hungry depending on you appetite. Dining room was mostly younger crowd while the bar was definitely the over 40 set. Would recommend that the naysayers return to see the improvement although I personally don\'t know the former glory to be able to compare. Easy access to downtown Salem without the crowds on this month of October.","date":"2014-10-11 03:34:02"}\n'
What we need to collect is the 'text' section of each review. How many reviews do we have ? Since we can't load them directly we compute it indirectly:
nb_of_lines = 0
with open(filepath, 'r') as source:
while True:
line = source.readline()
if not line:
print('\nEnd.')
break
else:
nb_of_lines += line.count('"text"')
if nb_of_lines%1000==0:
print(nb_of_lines, end='\r', flush=True)
print(f'There are {nb_of_lines} reviews.')
8635000 End. There are 8635403 reviews.
We'll work with dask DataFrame to get a sample of ~0.1% reviews (about 8600 reviews) over all data.
sample_proportion = 1/1000
all_reviews = ddf.read_json(filepath, blocksize=2**22)
reviews = all_reviews.sample(frac=sample_proportion)
Here are the first lines of this sample.
reviews.head()
| review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
|---|---|---|---|---|---|---|---|---|---|
| 3732 | sBRN9MfFBn19AXoMmoYxJQ | iR3hyz_EoKQ740g20tEmag | k4zJfURHmAWRMECfGS5fxQ | 5 | 0 | 0 | 0 | Sushi Katana is the perfect place for your Jap... | 2013-09-04 03:43:01 |
| 1139 | jysluSPNaYyUlLiDMFnZ6A | xBvaQwMU9Gsh3SYBa6CTWQ | nLYPM9DDqmOG9cZTvnCTOA | 3 | 0 | 0 | 0 | The portion sizes are HUGE. You are sure to g... | 2006-04-29 17:46:44 |
| 41 | yXKu-60gP_378PX0xzyHHg | uUrXZ2guG27PQUeR6u8K3w | WtDOs3a6k_oPJmwiDh4qBQ | 2 | 3 | 1 | 2 | I wanted this to be a great place, but I wasn'... | 2009-02-28 22:47:35 |
| 3477 | aAiYtmhWNJp8uAYC3jDqwg | n5sL_4DqsLOS5iQncWNcXA | --164t1nclzzmca7eDiJMw | 5 | 0 | 0 | 0 | Incredibly good food, and incredibly good serv... | 2012-07-16 13:42:38 |
| 218 | YnwoFB_QSeAMyBL37kpu9g | eaq31izc4yUnDzlvMvEanQ | -IIvmjoEKa9Trhf4OxzJeA | 5 | 0 | 0 | 0 | This place is amazing. From the staff, to the ... | 2017-09-20 14:29:34 |
To process the reviews, ou computer needs to convert them first into numbers since it is the only thing it can handle. But how can we achieve this ? One way is list all the words in all the reviews, to build a vocabulary, then count how many times each word of a review is in the vocabulary. This is called "Bag of words" ("bag", since the order of words is not taken in account). Then, we can try to automatically group reviews that use the same words and infer that they are thus about the same topic.
The Bag of words (BOW) is efficient bu may have at least 2 flaws:
To achieve this, we will first compute the frequency of each term in all the reviews, then take its inverse, so that the more frequent the word is, the lower this inverse value. Then, for each review, the frequency of each term in this review (say, the word 'chicken') is multiplied by the inverse frequency of the word in the whole vocabulary. Since not all reviews will talk about chicken, the final weight of the word "chicken" in the review may be high enough to let us think that it is an important topic for this review.
This double weighting is called TFIDF (Term Frequency - Inverse Document Frequency).
Before applying this process, we first need to clean the data to group close words. For example, we want the words 'Fried', "fry", 'fries", etc, to be grouped under a generic root 'fry', in lowercase. This process is lemmatization.
We also need to exclude too frequent words, such as "the", "in", etc. We can compute them manually, but we have a tool, Spacy, that has already computed the frequencies of all words in English language and can exclude them automatically (the stop words).
And of course, we need to separate the words in the text, excluding punctuation : this is also something that Spacy can handle.
Tokenization in Spacy is powerful enough to handle punctuation even if there are missing spaces (see example below), so our preprocess function will remain simple.
sentence = "I don't think so.It's really different,I've never seen this."
doc = nlp(sentence)
print(f'Tokenization of sentence "{sentence}"')
print('Result: ', end='')
for tok in doc:
print(tok, end=' / ')
Tokenization of sentence "I don't think so.It's really different,I've never seen this." Result: I / do / n't / think / so / . / It / 's / really / different / , / I've / never / seen / this / . /
(But notice the "I've" that should have been tokenized as "I 've" since "It's" was properly tokenized as "It 's'".)
def preprocess(text):
'''
Return text with lowered and lemmatized words after deleting digits, punctuation and Spacy stopwords.
'''
trimmed = text.replace('\n','')
doc = nlp(trimmed.lower())
return ' '.join([tok.lemma_ for tok in doc if not (tok.is_punct or tok.is_stop) and tok.is_alpha])
sample = reviews['text'].head(1).values[0]
Let's process our sample:
preprocess(sample)
'sushi katana perfect place japanese food experience nice food selection price right great service quality good japanese experience orlando'
Now that our reviews are cleaned (no punctuation, no stop words, lemmatized forms), we can compute the TFIDF vectors, each of them representing a review in our dataset.
docs = reviews['text']
tfidf = TfidfVectorizer(preprocessor=preprocess)
%%time
X = tfidf.fit_transform(docs)
CPU times: user 3min 24s, sys: 12 s, total: 3min 36s Wall time: 3min 38s
print(f'The vocabulary contains {X.shape[1]} words.')
The vocabulary contains 19621 words.
This is much too big. Most frequent words in English were dropped thanks to spacy stopwords. But among most rare words (including misspelled ones, how many would be dropped by tuning the min_df (minimum proportion of documents where a word must be found to be kept in vocabulary) ?
Let's find how many words remain in the vocabulary if we exclude those used (in our dataset of ~8 600 reviews) by less than 5 reviews, 10 reviews, 50 reviews, etc...
def compute_voc_size(min_df, docs=docs, preprocess=preprocess):
tfidf = TfidfVectorizer(preprocessor=preprocess, min_df=min_df)
tfidf.fit(docs)
return len(tfidf.vocabulary_)
sample_size = nb_of_lines * sample_proportion
rare_freqs = np.array([5, 8, 13, 21, 55, 89, 144, 233, 500, 1000])/sample_size
%%time
voc_size = []
for freq in tqdm(rare_freqs):
voc_len = compute_voc_size(freq)
voc_size.append((freq, voc_len))
px.line(pd.DataFrame(voc_size, columns=['min_df', 'vocab. size']), x='min_df', y='vocab. size')
CPU times: user 31min 23s, sys: 1min 57s, total: 33min 21s Wall time: 33min 25s
We pick a value of 0.001 for our min_df. This means that words that appear in less than 0.1% of reviews (= ~8 reviews in our sample dataset) will be discarded.
%%time
tfidf = TfidfVectorizer(preprocessor=preprocess, min_df=0.001)
X = tfidf.fit_transform(docs)
print('Vocabulary size:', len(tfidf.vocabulary_))
Vocabulary size: 3629 CPU times: user 3min 5s, sys: 11.7 s, total: 3min 17s Wall time: 3min 18s
We end up with a matrix representing the weighted frequency of each vocabulary word in each review.
matrix = pd.DataFrame(X.todense(), columns=tfidf.get_feature_names_out())
matrix
| ability | able | absolute | absolutely | absurd | ac | accent | accept | acceptable | access | ... | yogurt | york | you | young | yr | yuck | yum | yummy | zero | zucchini | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8267 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 8268 | 0.0 | 0.071988 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 8269 | 0.0 | 0.073262 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 8270 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.121478 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 8271 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
8272 rows × 3629 columns
Let's compare first matrix line with first sample, from which it was created:
words = matrix.loc[0]>0
words = words[words>0].index
preprocessed_sample = preprocess(sample)
preprocessed_sample
'sushi katana perfect place japanese food experience nice food selection price right great service quality good japanese experience orlando'
print(list(words))
['experience', 'food', 'good', 'great', 'japanese', 'nice', 'orlando', 'perfect', 'place', 'price', 'quality', 'right', 'selection', 'service', 'sushi']
Can all words be found in sample ?
all([word in preprocessed_sample for word in words])
True
What proportion of sample words is left after processing ?
print("{:.2%}".format(len(words)/len(preprocessed_sample.split())))
78.95%
def format_reviews_for_hovering(text):
'''
Replace every 10th space in text by "<br>" so that the text
can be fully displayed in Plotly scatter plot hovering.
'''
words = text.split()
result = ['']
for word in words:
if len(result)%10==0:
result.append('<br>'+word)
else:
result.append(word)
return ' '.join(result).strip().replace(' <br>', '<br>')
formatted_reviews = docs.apply(format_reviews_for_hovering, meta=('text', 'object')).compute()
stars = reviews['stars'].compute()
def plot_scatter_with_reviews(projection, reviews=formatted_reviews, color=stars):
'''
Display scatter plot of projection with reviews in hover data
and colors according to score.
'''
projection_df = pd.DataFrame(projection)
projection_df = projection_df.rename(columns={i: str(i) for i in projection_df.columns})
projection_df['review'] = reviews.values
if projection_df.shape[1]==3:
display(px.scatter(projection_df, x='0', y='1', hover_data=['review'], color=color,
opacity=0.7, height=800, width=800, labels={'color': 'stars'}))
elif projection_df.shape[1]==4:
display(px.scatter_3d(projection_df, x='0', y='1', z='2', hover_data=['review'],
color=color, opacity=0.5, height=600, width=800, labels={'color': 'stars'}))
A first attempt to apply PCA to our X sparse matrix gave following error message:
TypeError: PCA does not support sparse input. See TruncatedSVD for a possible alternative.
Extract of sklean doc for sklearn.decomposition.TruncatedSVD:
Dimensionality reduction using truncated SVD (aka LSA).
This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.
In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).
from sklearn.decomposition import TruncatedSVD
tsvd_2d = TruncatedSVD(n_components=2, random_state=42)
tsvd_2d_projection = tsvd_2d.fit_transform(X)
d1, d2 = tsvd_2d.explained_variance_ratio_
print(f'The first 2 components explain respectively {d1:.4%} and {d2:.4%} of total variance.')
The first 2 components explain respectively 0.3941% and 0.7588% of total variance.
This is low but remember that our initial X matrix has many features!
size = X.todense().shape[1]
print(f"On average, each of {size} features explains {'{:.4%}'.format(1/size)} of variance.")
On average, each of 3629 features explains 0.0276% of variance.
plot_scatter_with_reviews(tsvd_2d_projection)
This scatter's shape evokes the trajectory of particles expelled from a single point located on (0,0).
Moreover, we notice that the reviews scores seem to be reflected in the data (worst scores at the top of scatter plot), while the training and the dimension reduction were exclusively made on texts, not scores. Scores only have been added as a color in the plot after computing dimensional reduction.
And with 3 dimensions ?
tsvd_3d = TruncatedSVD(n_components=3, random_state=42)
tsvd_3d_projection = tsvd_3d.fit_transform(X)
d1, d2, d3 = tsvd_3d.explained_variance_ratio_
print(f'The first components explain respectively {d1:.4%}, {d2:.4%} and {d3:.4%} of total variance.')
The first components explain respectively 0.3941%, 0.7587% and 0.6060% of total variance.
plot_scatter_with_reviews(tsvd_3d_projection)
What are these components made of?
components_3d = pd.DataFrame(tsvd_3d.components_, columns=matrix.columns).T
for component in components_3d.columns:
sorted_df = pd.DataFrame(components_3d[component].sort_values(ascending=False))
print(f'\nComponent #{component+1}')
print('highest variance')
display(sorted_df.head(12).T)
print('lowest variance')
display(sorted_df.tail(12).T)
Component #1 highest variance
| food | good | place | great | time | service | come | order | like | go | love | try | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.2376 | 0.235244 | 0.205358 | 0.198078 | 0.160578 | 0.157787 | 0.147746 | 0.146628 | 0.137496 | 0.12064 | 0.115957 | 0.110505 |
lowest variance
| costco | inconvenient | overnight | consult | perspective | expertise | technique | consultation | subject | medication | identify | verify | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000774 | 0.000766 | 0.000765 | 0.000741 | 0.000734 | 0.000725 | 0.000723 | 0.000715 | 0.000698 | 0.000653 | 0.000647 | 0.000541 |
Component #2 highest variance
| car | tell | time | work | say | go | ask | call | customer | need | minute | take | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.185578 | 0.157254 | 0.146001 | 0.140285 | 0.127519 | 0.125611 | 0.117755 | 0.116513 | 0.108068 | 0.107668 | 0.103444 | 0.101028 |
lowest variance
| atmosphere | sauce | fresh | restaurant | fry | place | love | delicious | chicken | great | good | food | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.068572 | -0.07938 | -0.086185 | -0.087869 | -0.097008 | -0.099048 | -0.100408 | -0.122928 | -0.137112 | -0.178229 | -0.213763 | -0.274214 |
Component #3 highest variance
| great | service | friendly | staff | recommend | love | highly | price | place | food | atmosphere | amazing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 0.526145 | 0.21098 | 0.170698 | 0.158777 | 0.141351 | 0.137579 | 0.096272 | 0.087148 | 0.079774 | 0.079686 | 0.079575 | 0.077708 |
lowest variance
| salad | wait | table | taste | like | burger | eat | sauce | fry | pizza | chicken | order | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | -0.073808 | -0.083485 | -0.086788 | -0.087566 | -0.087588 | -0.088471 | -0.09404 | -0.107321 | -0.113074 | -0.116743 | -0.163573 | -0.318281 |
It seems that we could describe these 3 components as follows:
How about next components? Let's look at the 10 first:
tsvd = TruncatedSVD(n_components=10, random_state=42)
tsvd_projection = tsvd.fit_transform(X)
components = pd.DataFrame(tsvd.components_, columns=matrix.columns).T
wc_red = WordCloud(background_color='white', color_func=get_single_color_func('coral'))
wc_blue = WordCloud(background_color='white',color_func=get_single_color_func('green'))
max_words = 12
for component in components.columns:
sorted_df = pd.DataFrame(components[component].sort_values(ascending=False))
# We want to display words that contribute most to the variance,
# negatively as well as positively.
# For positive variance, we select the max_words first ones.
# for negative variance, we first get the minimum variance among the
# positive ones we selected; then, we select only words with a
# negative variance bigger (in absolute value) than this threshold.
# We do so to avoid displaying words with a variance close to -0.
variance_threshold = sorted_df.iloc[max_words, 0]
highest_pos_variances = sorted_df[component].head(max_words).to_dict()
lowest_neg_variances = np.abs(sorted_df[component][sorted_df[component]<-variance_threshold].tail(max_words)).to_dict()
fig = plt.figure(figsize=(8,2.5))
fig.suptitle(f'\nComponent #{component+1}\n', fontsize=18, va='bottom')
# highest variances
plt.subplot(1, 2, 1)
plt.imshow(wc_blue.generate_from_frequencies(highest_pos_variances))
plt.title('Strongest influence\n', fontsize=14)
plt.axis('off')
# lowest variances
plt.subplot(1, 2, 2)
if lowest_neg_variances:
plt.imshow(wc_red.generate_from_frequencies(lowest_neg_variances))
plt.title("Strongest opposite influence\n", fontsize=14)
else:
plt.text(0.5, 0.5,'No words with\nstrong opposite influence', ha='center',
va='center', fontsize=12)
plt.axis('off')
plt.show()
from sklearn.manifold import TSNE
tsne_projection = TSNE(init='random', learning_rate='auto', random_state=42, perplexity=20).fit_transform(X)
plot_scatter_with_reviews(tsne_projection)
We'll first reduce the number of dimensions with TruncatedSVD.
reduced_tsvd = TruncatedSVD(n_components=100, random_state=42).fit_transform(X)
from umap import UMAP
umap_projection = UMAP(n_components=2, random_state=42).fit_transform(reduced_tsvd)
plot_scatter_with_reviews(umap_projection)
One can notice that some blobs group reviews about hair stylists, car cleaners / retailers, hotels. Other blobs are related to specific food. So topic modeling may allow to find out those specific topics.
We can extract topics from our reviews. The main parameter is the number of topics: the highest the number, the more detailed topics we'll get.
For instance, with more than 15-20, we notice topics such as kids (cf section with 14 topics, topicc #7) or healthcare (), that were not noticeable with a lower number of topics.
from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer(preprocessor=preprocess, min_df=0.001)
counted = count_vec.fit_transform(reviews['text'])
voc = count_vec.get_feature_names_out()
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=2)
lda.fit_transform(counted)
topics = pd.DataFrame(lda.components_, columns=voc).T
wc = WordCloud(background_color='white', max_words=12, colormap='brg')
for nb_topics in [2,8,15,40]:
lda = LatentDirichletAllocation(n_components=nb_topics)
lda.fit_transform(counted)
topics = pd.DataFrame(lda.components_, columns=voc).T
n_cols = 4
n_rows = int(np.ceil(nb_topics/n_cols))
fig = plt.figure(figsize=(15, 2.5 * n_rows))
fig.suptitle(f"Number of topics: {nb_topics}", fontsize=16)
for topic in topics.columns:
ax = fig.add_subplot(n_rows, n_cols, topic+1)
ax.title.set_text(f'topic {topic+1}')
weighted_words = topics[topic].to_dict()
wordcloud = wc.generate_from_frequencies(weighted_words)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis("off")
plt.show()
At first we may think that TruncatedSVD / LSA gives more interesting results than LDA, since in Truncated SVD, the words with negative impact on variance help to betted understand the topic, while words with lowest scores in LDA components are just non-significant.
But word clouds show that some reviews are on other subjects than food, and LDA may help us to find them more easily.
To select the best number of topics, we have to look at each set of wordcloud and decide manually if it is relevant: for instance, 10 topics seem to be an interesting choice since we clearly see a 'haircut' topic, etc. But how could we measure more objectively the relevancy of this choice?
A silhouette score cannot be used here since the BOW concept doesn't know semantic similarity. We, humans, can confirm that it is relevant to link "hair" and "cut", because we know they are semantically related; while the only thing our BOW model knows is that these words are often seen together in our corpus, which is not the same thing.
To find the best number of components, a strategy could be to use a word embedding that takes semantic similarity in account, such as Word2vec (used by our Spacy model) or BERT, then compute a silhouette score of each 'clustering' (the number of components can be seen as a centroid number). But this is out of the scope of this project.